Goto

Collaborating Authors

 unseen environment


Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation

Neural Information Processing Systems

The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D.








Appendix: LanguageandVisualEntityRelationship GraphforAgentNavigation

Neural Information Processing Systems

Directional features As in previous work [3, 6, 10], we apply a 128-dimensional directional encoding byreplicating(cosฮธi,sinฮธi,cosฯ†i,sinฯ†i)by32times torepresent theorientation ofeach single-viewiwith respect to the agent's current orientation, whereฮธi andฯ†i are the angles of the heading and elevation to that single-view. Replicating the encoding by 32 times does not enrich its information but makes its gradient 32 times larger during back-propagation. We suspect that this benefits the agent to learn about the action-related terms (e.g.



CounterfactualVision-and-LanguageNavigation: UnravellingtheUnseen

Neural Information Processing Systems

Aprominent challenge is to train an agent capable of generalising to new environments attest time, rather than one that simply memorises trajectories and visual details observed during training. We propose a new learning strategy that learns both from observations and generatedcounterfactual environments.